A Gradient-Based Boosting Algorithm for Regression Problems

نویسندگان

Richard S. Zemel

Toniann Pitassi

چکیده

In adaptive boosting, several weak learners trained sequentially are combined to boost the overall algorithm performance. Recently adaptive boostingmethods for classification problems have been derived asgradient descent algorithms. This formulation justifies key elements and parameters in the methods, all chosen to optimize a single common objective function. Wepropose an analogous formulation for adaptive boosting of regression problems, utilizing a novel objective function that leads to a simple boosting algorithm. We prove that this method reduces training error, and compare its performance to other regression methods. The aim of boosting algorithms is to “boost” the small advantage that a hypothesis produced by a weak learner can achieve over random guessing, by using the weak learning procedure several times on a sequence of carefully constructed distributions. Boosting methods, notably AdaBoost (Freund & Schapire, 1997), are simple yet powerful algorithms that are easy to implement and yield excellent results in practice. Two crucial elements of boosting algorithms are the way in which a new distribution is constructed for the learning procedure to produce the next hypothesis in the sequence, and the way in which hypotheses are combined to produce a highly accurate output. Both of these involve a set of parameters, whose values appeared to be determined in an ad hoc manner. Recently boosting algorithms have been derived as gradient descent algorithms (Breiman, 1997; Schapire & Singer, 1998; Friedman et al., 1999;Mason et al., 1999). These formulations justify the parameter values as all serving to optimize a single commonobjective function. These optimization formulationsof boosting originally developed for classification problems have recently been applied to regression problems. However, key properties of these regression boostingmethods deviate significantly from the classification boosting approach. We propose a new boosting algorithm for regression problems, also derived from a central objective function, which retains these properties. In this paper, we describe the original boosting algorithm and summarize boosting methods for regression. We present our method and provide a simple proof that elucidates conditions under which convergence on training error can be guaranteed. We propose a probabilistic framework that clarifies the relationship between various optimization-based boosting methods. Finally, we summarize empirical comparisons between our method and others on some standard problems. 1 A Brief Summary of Boosting Methods Adaptive boostingmethods are simplemodular algorithms that operate as follows. Let g : X ! Y be the function to be learned, where the label set Y is finite, typically binary-valued. The algorithm uses a learning procedure, which has access to n training examples, f(x1; y1); : : : ; (xn; yn)g, drawn randomly from X Y according to distribution D; it outputs a hypothesis f : X ! Y , whose error is the expected value of a loss function on f(x); g(x), where x is chosen according to D. Given ; > 0 and access to random examples, a strong learning procedure outputs with probability 1 a hypothesis with error at most , with running time polynomial in 1= , 1= and the number of examples. A weak learning procedure satisfies the same conditions, but where need only be better than random guessing. Schapire (1990) showed that any weak learning procedure, denoted WeakLearn, can be efficiently transformed (“boosted”) into a strong learning procedure. The AdaBoost algorithm achieves this by calling WeakLearn multiple times, in a sequence of T stages, each time presenting it with a different distribution over a fixed training set and finally combining all of the hypotheses. The algorithmmaintains a weight wi t for each training example i at stage t, and a distributionDt is computed by normalizing these weights. The algorithm loops through these steps: 1. At stage t, the distributionDt is given toWeakLearn, which generates a hypothesis ft. The error rate t of ft w.r.t. Dt is: t =Pi:ft(xi)6=yi wi t=Pni=1wi t 2. The new training distribution is obtained from the new weights: wi t+1 = wi t ( t=(1 t))1 jft(xi) yij After T stages, a test example xwill be classified by a combinedweighted-majority hypothesis: ŷ = sgn(PTt=1 ctft(x)). Each combination coefficient ct = log((1 t)= t) takes into account the accuracy of hypothesis ft with respect to its distribution. The optimization approach derives these equations as all minimizing a common objective function J , the expected error of the combined hypotheses, estimated from the training set. The new hypothesis is the step in function space in the direction of steepest descent of this objective. For example, if J = 1 nPni=1 exp( Pt yictft(xi)), then the cost after T rounds is the cost after T 1 rounds times the cost of hypothesis fT : J(T ) = 1 n n Xi=1 exp( T 1 Xt=1 yictft(xi)) exp( yicTfT (xi)) = Xi wi T exp( yicT fT (xi)) so training fT to minimize J(T ) amounts to minimizing the cost on a weighted training distribution. Similarly, the training distribution is formed by normalizing updated weights: wi t+1 = wi t exp( yictft(xi)) = wi t exp(sitct)where sit = 1 if ft(xi) 6= yi, else sit = 1. Note that because the objective function J is multiplicative in the costs of the hypotheses, a key property follows: The objective for each hypothesis is formed simply by re-weighting the training distribution. This boosting algorithm applies to binary classification problems, but it does not readily generalize to regression problems. Intuitively, regression problems present special difficulties because hypothesesmay not just be right or wrong, but can be a littlewrong or very wrong. Recently a spateof clever optimization-basedboosting methods have been proposed for regression (Duffy & Helmbold, 2000; Friedman, 1999; Karakoulas & Shawe-Taylor, 1999; Rätsch et al., 2000). While these methods involve diverse objectives and optimization approaches, they are alike in that new hypotheses are formed not by simply changing the example weights, but instead by modifying the target values. As such they can be viewed as forms of forward stage-wise additive models (Hastie& Tibshirani, 1990), which produce hypotheses sequentially to reduce residual error. We study a simple example of this approach, in which hypothesis T is trained not to produce the target output yi on a given case i, but instead to fit the current residual, ri T , where ri T = yi PT 1 t=1 ctft(x). Note that this approach develops a series of hypotheses all based on optimizing a common objective, but it deviates from standard boosting in that the distribution of examples is not used to control the generation of hypotheses, and each hypothesis is not trained to learn the same function. 2 An Objective Function for Boosting Regression Problems We derive a boosting algorithm for regression from a different objective function. This algorithm is similar to the original classification boosting method in that the objective is multiplicative in the hypotheses’ costs,which means that the target outputs are not altered after each stage, but rather the objective for each hypothesis is formed simply by re-weighting the training distribution. The objective function is: JT = 1 n n Xi=1 T Y t=1 c 12 t ! exp" T Xt=1 ct(ft(xi) yi)2# (1) Here, training hypothesisT tominimize JT , the cost afterT stages, amounts tominimizing the exponentiated squared error of a weighted training distribution: JT = 1 n n Xi=1 (T 1 Y t=1 c 12 t ) exp"T 1 Xt=1 ct(ft(xi) yi)2#! c 1 2 T exp cT (fT (xi) yi)2 = n Xi=1 wi T c 12 T exp cT (fT (xi) yi)2 Weupdate eachweight bymultiplying by its respective error, and form the training distribution for the next hypothesis by normalizing these updated weights. In the standard AdaBoost algorithm, the combination coefficient ct can be analytically determined by solving @Jt @ct = 0 for ct. Unfortunately, one cannot analytically determine the combination coefficient ct in our algorithm, but a simple line search can be used to find value of ct thatminimizes the costJt. We limit ct to be between0 and 1. Finally, optimizingJ with respect to y produces a simple linear combination rule for the estimate: ŷ =Pt ctft(x)=Pt ct. We introduce a constant as a threshold used to demarcate correct from incorrect responses. This threshold is the single parameterof this algorithm thatmust be chosen in a problem-dependent manner. It is used to judge when the performance of a new hypothesis warrants its inclusion: t =Pi pit exp[(ft(xi) yi)2 ] < 1. The algorithm can be summarized as follows: New Boosting Algorithm 1. Input: training set examples (x1; y1); ::::(xn; yn)with y 2 <; WeakLearn: learning procedure produces a hypothesis ft(x) whose accuracy on the training set is judged according to J 2. Choose initial distribution p1(xi) = pi1 = wi 1 = 1 n 3. Iterate: CallWeakLearn – minimize Jt with distribution pt Accept iff t =Pi pit exp[(ft(xi) yi)2 ] < 1 Set 0 ct 1 to minimize Jt (using line search) Update training distribution wi t+1 = wi t c 1 2 t exp ct(ft(xi) yi)2 pit+1 = wi t+1= n Xj=1wj t+1 4. Estimate output y on input x: ŷ =Xt ctft(x)=Xt ct 3 Proof of Convergence Theorem: Assume that for all t T , hypothesis t makes error t on its distribution. If the combined output ŷ is considered to be in error iff (ŷ y)2 > , then the output of the boosting algorithm (after T stages) will have error at most , where = P [(ŷi yi)2 > ] T Y t=1 t exp[ (T T Xt=1 ct)] . Proof: We follow the approach used in the AdaBoost proof (Freund & Schapire, 1997). We show that the sum of the weights at stage T is bounded above by a constant times the product of the t’s, while at the same time, for each input i that is incorrect, its corresponding weight wi T at stage T is significant. n Xi=1 wi T+1 = Xi wi T c 1=2 T exp[cT (fT (xi) yi)2] c 1=2 T T exp( )Xi wi T T Yt=1 c 1=2 T exp( ) t The inequality holds because 0 ct 1. We now compute the new weights: Xt ct(ft(xi) yi)2 = [Xt ct][Var(f i) + (ŷi yi)2] [Xt ct][(ŷi yi)2] where ŷi =Pt ctft(xi)=Pt ct andVar(f i) =Pt ct(ft(xi) ŷi)2=Pt ct. Thus, wi T+1 = ( T Y t=1 c 1=2 t ) exp( T Xt=1 ct(ft(xi) yi)2) ( T Y t=1 c 1=2 t ) exp([ T Xt=1 ct][(ŷi yi)2]) Now consider an example input k such that the final answer is an error. Then, by definition, (ŷk yk)2 > ) wk T+1 (Qt c 1=2 t ) exp( Pt ct). If is the total error rate of the combination output, then: Xi wi T+1 X k:k errorwk T+1 ( T Y t=1 c 1=2 t ) exp( T Xt=1 ct) (Xi wi T+1)( T Y t=1 c1=2 t ) exp[ (T Xt ct)] T Y t=1 t exp[ (T Xt ct)] Note that as in the binary AdaBoost theorem, there are no assumptionsmade here about t, the error rate of individual hypotheses. If all t = < 1, then < T exp[ (T Pt ct)], which is exponentially decreasing as long as ct ! 1. 4 Comparing the Objectives We can compare the objectives by adopting a probabilistic framework. We associate a probability distribution with the output of each hypothesis on input x, and combine them to form a consensus modelM by multiplying the distributions: g(yjx;M ) Qt pt(yjx; t),where t are parameters specific to hypothesis t. If each hypothesis t produces a single output ft(x) and has confidence ct assigned to it, then pt(yjx; t) can be considered a Gaussian with mean ft(x) and variance 1=ct g(yjx;M ) = k "Yt c1=2 t # exp " Xt ct(y ft(x))2# Model parameters can be tuned to maximize g(y jx;M ), where y is the target for x; our objective (Eq. 1) is the expected value of the reciprocal of g(y jx;M ). An alternative objective can be derived by first normalizing g(yjx;M ): p(yjx;M ) g(yjx;M ) Ry0 g(yjx;M ) Qt pt(yjx; t) Ry0 Qt pt(y0jx; t)dy0 This probability model underlies the product-of-experts model (Hinton, 2000) and the logarithmic opinion pool (Bordley, 1982).If we again assume pt(yjx; t) N (ft(x); c 1 t )), then p(yjx;M ) is a Gaussian, with mean f(x) = Pt ctft(x) Pt ct and inverse variance c =Pt ct. The objective for this model is: JR = log p(y jx;M ) = c hy f(x)i2 12 log c (2) This objective corresponds to a type of residual-fitting algorithm. If r(x) hy f(x)i, and fctg for t < T are assumed frozen, then training fT to minimize JR is achieved by using r(x) as a target. These objectives can be further comparedw.r.t. a bias-variance decomposition (Geman et al., 1992; Heskes, 1998). The main term in our objective can be re-expressed: Xt ct [y ft(x)]2 =Xt ct hy f(x)i2+Xt ct hft(x) f(x)i2 = bias+variance Meanwhile, the main term of JR corresponds to the bias term. Hence a new hypothesis can minimize JR by having low error (ft(x) = y ), or with a deviant (ambiguous) response (ft(x) 6= f(x)) (Krogh & Vedelsby, 1995). Thus our objective attempts to minimize the average error of the models, while the residual-fitting objective minimizes the error of the average model. 2 4 6 8 10 0.035 0.04 0.045 0.05 0.055 0.06 0.065 OurAlg ResidualFit MixOfExperts 2 4 6 8 10 0.25 0.3 0.35 OurAlg ResidualFit MixOfExperts Figure 1: Generalization results for our gradient-based boosting algorithm, compared to the residual-fitting and mixture-of-experts algorithms. Left: Test problem F1; Right: Boston housing data. Normalized mean-squared error is plotted against the number of stages of boosting (or number of experts for the mixture-of-experts). 5 Empirical Tests of Algorithm We report results comparing the performance of our new algorithmwith two other algorithms. The first is a residual-fitting algorithmbasedon theJR objective (Eq. 2), but the coefficients are not normalized. The second algorithm is a version of the mixture-of-experts algorithm (Jacobs et al., 1991). Here the hypotheses (or experts) are trained simultaneously. In the standardmixture-of-experts the combination coefficients depend on the input; to make this model comparable to the others, we allowed each expert one input-independent, adaptable coefficient. This algorithm provides a good alternative to the greedy stage-wise methods, in that the experts are trained simultaneously to collectively fit the data. We evaluate these algorithms on twoproblems. The first is the nonlinear prediction problem F1 (Friedman, 1991), which has 10 independent input variables uniform in [0; 1]: y = 10 sin( x1x2)+20(x3 :5)2+10x4+5x5+nwhere n is a random variable drawn from a mean-zero, unit-variance normal distribution. In this problem, only five input variables (x1 to x5) have predictive value. We rescaled the target values y to be in [0; 3]. We used 400 training examples, and 100 validation and test examples. The second test problem is the standard Boston Housing problem Here there are 506 examples and twelve continuous input variables. We scaled the input variables to be in [0; 1], and the outputs to be in [0; 5]. We used 400 of the examples for training, 50 for validation, and the remainder to test generalization. We used neural networks as the hypotheses and back-propagation as the learning procedure to train them. Each network had a layer of tanh() units between the input units and a single linear output. For each algorithm, we used early stopping with a validation set in order to reduce over-fitting in the hypotheses. One finding was that the other algorithms out-performed ours when the hypotheses were simple: when the weak learners had only one or two hidden nodes, the residual-fitting algorithm reduced test error. With more hidden nodes the relative performance of our algorithm improved. Figure 1 shows average results for threehidden-unit networksover 20 runs of each algorithm on the twoproblems, with examples randomly assigned to the three sets on each run. The resultswere consistent for different values of in our algorithm; here = 0:1. Overall, the residual-fitting algorithm exhibited more over-fitting than our method. Over-fitting in these approaches may be tempered: a regularization technique known as shrinkage, which scales combination coefficients by a fractional parameter, has been found to improve generalization in gradient boosting applications to classification (Friedman, 1999). Finally, the mixture-of-experts algorithm generally out-performed the sequential training algorithm. A drawback of this method is the need to specify the number of hypotheses in advance; however, given that number, simultaneous training is likely less prone to local minima than the sequential approaches.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Machine Learning Models for Housing Prices Forecasting using Registration Data

This article has been compiled to identify the best model of housing price forecasting using machine learning methods with maximum accuracy and minimum error. Five important machine learning algorithms are used to predict housing prices, including Nearest Neighbor Regression Algorithm (KNNR), Support Vector Regression Algorithm (SVR), Random Forest Regression Algorithm (RFR), Extreme Gradient B...

متن کامل

Functional gradient ascent for Probit regression

This paper proposes two gradient based methods to fit a Probit regression model by maximizing the sample log-likelihood function. Using the property of the Hessian of the objective function, the first method performs weighted least square regression in each iteration of the Newton–Raphson framework, resulting in ProbitBoost, a boosting-like algorithm. Motivated by the gradient boosting algorith...

متن کامل

Gradient Tree Boosting for Training Conditional Random Fields

Conditional random fields (CRFs) provide a flexible and powerful model for sequence labeling problems. However, existing learning algorithms are slow, particularly in problems with large numbers of potential input features and feature combinations. This paper describes a new algorithm for training CRFs via gradient tree boosting. In tree boosting, the CRF potential functions are represented as ...

متن کامل

Smooth ε-Insensitive Regression by Loss Symmetrization

We describe a framework for solving regression problems by reduction to classification. Our reduction is based on symmetrization of margin-based loss functions commonly used in boosting algorithms, namely, the logistic-loss and the exponential-loss. Our construction yields a smooth version of the ε-insensitive hinge loss that is used in support vector regression. Furthermore, this construction ...

متن کامل

Online Gradient Boosting

We extend the theory of boosting for regression problems to the online learning setting. Generalizing from the batch setting for boosting, the notion of a weak learning algorithm is modeled as an online learning algorithm with linear loss functions that competes with a base class of regression functions, while a strong learning algorithm is an online learning algorithm with smooth convex loss f...

متن کامل

Collective Entity Disambiguation with Structured Gradient Tree Boosting

We present a gradient-tree-boosting-based structured learning model for jointly disambiguating named entities in a document. Gradient tree boosting is a widely used machine learning algorithm that underlies many topperforming natural language processing systems. Surprisingly, most works limit the use of gradient tree boosting as a tool for regular classification or regression problems, despite ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

A Gradient-Based Boosting Algorithm for Regression Problems

نویسندگان

چکیده

منابع مشابه

Machine Learning Models for Housing Prices Forecasting using Registration Data

Functional gradient ascent for Probit regression

Gradient Tree Boosting for Training Conditional Random Fields

Smooth ε-Insensitive Regression by Loss Symmetrization

Online Gradient Boosting

Collective Entity Disambiguation with Structured Gradient Tree Boosting

عنوان ژورنال:

اشتراک گذاری